Getting started

How to begin a script

Always set the directory where you will be working, and load the packages (libraries) that you will need.

rm(list=ls()) # clear all variables in the memory
setwd(path_to_directory) # set the directory

# Load the libraries that you are going to need
library(RColorBrewer) # nice colors
library(ISLR)
cols <- brewer.pal(9, "Set1") # nice color palette

Session environment:

  • getwd() where am I?
  • ls() list R objects
  • dir() list files in the directory
  • rm(list=ls()) clear all variables in the memory
  • rm('variable_name') remove only one or a few specified variables

You can get help with:

  • help(ls) or ?ls
  • help(package = stats)

Google is your best friend: try also googling “<topic> in R”.

Useful tip: tab to autocomplete functions, to see their arguments and to navigate through directories

Basic math

Assignment operation:

a = 1
b <- 2
3 -> c
a; b; c
[1] 1
[1] 2
[1] 3

Note: the assignment operator <- can seem strange to some, great to others. If you feel more comfortable, remember you can still use =. Do not confuse this with the equality operator ==.

R can interpret expressions

x <- 3
x
[1] 3

and compute mathematical expressions

y <- 2 * x + 4
y
[1] 10

Special types are NA, NaN, +Inf, -Inf.

vec_na <- c(NA, 3, 4, +Inf)
sum(vec_na)
[1] NA
is.na(vec_na)
[1]  TRUE FALSE FALSE FALSE

Remember the effect of “forbidden” operations

+Inf - Inf
[1] NaN
3/0
[1] Inf
-3/0
[1] -Inf
0/0
[1] NaN

Here is a list of the syntax for common mathematical operations. This is not meant to be an exhaustive list.

Operator Description
+ addition
- subtraction
* element-wise multiplication
/ element-wise division
%*% matrix multiplication
^ exponentiation
%% modulo division
< smaller than
<= smaller or equal than
> larger or equal than
>= larger or equal than
== equal
!= not equal

Control flow

Let us define a vector:

z <- rep(1, 10) # repeats 1 for 10 times

We can loop over some variable using a for loop:

for (i in 3:10){
  z[i] <- z[i - 1] + z[i - 2]
}
z
 [1]  1  1  2  3  5  8 13 21 34 55

We can also branch the code in chunks according to some criterion, using the if-then-else statement:

if ( (2 + 1) == 3 ) print("yes") else print("no")
[1] "yes"

Please, remember that indenting your code is essential (in some programming languages you’ll get thrown errors). Careful formatting is key to good code.

for (i in 1:10){
  if (i == 5){
    cat(i, " ")
  }
  else{
    cat("Not five ")
  }
}
Not five Not five Not five Not five 5  Not five Not five Not five Not five Not five 

The statements if and for are sufficient for the vast majority of programs. However, in few instances you might be confortable using other constructs, such as the while loop:

i <- 7
while (i){
  z[i] <- z[i] / 2
  i <- i - 1
  if (i < 3){
    break # the breaks exits from the for loop
  }
}
z # divide elements 4:7 by 2
 [1]  1.0  1.0  1.0  1.5  2.5  4.0  6.5 21.0 34.0 55.0

while is less common but useful in cases with an indeterminate number of loops.

Read and write data

You can either use read_csv for a .csv file, or read.table for a .txt file. It will convert the data to the dataframe type (more on that later).

Auto <- read.csv("./Data/Auto.csv")
Auto[1:4,]
  mpg cylinders displacement horsepower weight acceleration year origin
1  18         8          307        130   3504         12.0   70      1
2  15         8          350        165   3693         11.5   70      1
3  18         8          318        150   3436         11.0   70      1
4  16         8          304        150   3433         12.0   70      1
                       name
1 chevrolet chevelle malibu
2         buick skylark 320
3        plymouth satellite
4             amc rebel sst
dim(Auto)
[1] 397   9
Auto <- na.omit(Auto) # removes NA from data frame
dim(Auto)
[1] 397   9
names(Auto)
[1] "mpg"          "cylinders"    "displacement" "horsepower"   "weight"      
[6] "acceleration" "year"         "origin"       "name"        

Conversely, write_csv can be used to write a dataframe from R to a .csv file.

Data types

Vectors

Vectors can be defined in several ways. One example is the function seq. It takes as input the bounds of the vector and the stepsize.

vec <- seq(1, 9, by = 0.1)
vec
 [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8
[20] 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7
[39] 4.8 4.9 5.0 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6.0 6.1 6.2 6.3 6.4 6.5 6.6
[58] 6.7 6.8 6.9 7.0 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8.0 8.1 8.2 8.3 8.4 8.5
[77] 8.6 8.7 8.8 8.9 9.0
vec <- 1:9 # equivalent to seq(1, 9, by = 1)
vec
[1] 1 2 3 4 5 6 7 8 9

Alternatively, it can takes as input the bounds of the vector and its length.

vec <- seq(1, 9, length.out = 100) # useful for fine grids
vec
  [1] 1.000000 1.080808 1.161616 1.242424 1.323232 1.404040 1.484848 1.565657
  [9] 1.646465 1.727273 1.808081 1.888889 1.969697 2.050505 2.131313 2.212121
 [17] 2.292929 2.373737 2.454545 2.535354 2.616162 2.696970 2.777778 2.858586
 [25] 2.939394 3.020202 3.101010 3.181818 3.262626 3.343434 3.424242 3.505051
 [33] 3.585859 3.666667 3.747475 3.828283 3.909091 3.989899 4.070707 4.151515
 [41] 4.232323 4.313131 4.393939 4.474747 4.555556 4.636364 4.717172 4.797980
 [49] 4.878788 4.959596 5.040404 5.121212 5.202020 5.282828 5.363636 5.444444
 [57] 5.525253 5.606061 5.686869 5.767677 5.848485 5.929293 6.010101 6.090909
 [65] 6.171717 6.252525 6.333333 6.414141 6.494949 6.575758 6.656566 6.737374
 [73] 6.818182 6.898990 6.979798 7.060606 7.141414 7.222222 7.303030 7.383838
 [81] 7.464646 7.545455 7.626263 7.707071 7.787879 7.868687 7.949495 8.030303
 [89] 8.111111 8.191919 8.272727 8.353535 8.434343 8.515152 8.595960 8.676768
 [97] 8.757576 8.838384 8.919192 9.000000

One of the most used functions is c(), which stands for concatenate. You can concatenate vectors with vectors, scalars with scalars, or vectors with scalars.

A <- c(1, 6, 3)
B <- c(0, -2, 1)
vec <- c(A, 2, B)
vec
[1]  1  6  3  2  0 -2  1

And, of course, you can combine all the functions previously seen.

vec <- c(rep(1, 10), rep(2, 20))
vec
 [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

Matrices and arrays

A <- matrix(1:16, 4, 4)
A
     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
[3,]    3    7   11   15
[4,]    4    8   12   16

Oops, maybe we wanted to actually order those elements by row!

A <- matrix(1:16, 4, 4, byrow = T)
A
     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
[2,]    5    6    7    8
[3,]    9   10   11   12
[4,]   13   14   15   16

Fundamentally important functions are cbind() and rbind().

B <- matrix(runif(8), 4, 2, byrow = T)
cbind(A, B)
     [,1] [,2] [,3] [,4]      [,5]      [,6]
[1,]    1    2    3    4 0.4430378 0.7645879
[2,]    5    6    7    8 0.5332281 0.9026283
[3,]    9   10   11   12 0.2495218 0.9820944
[4,]   13   14   15   16 0.1327472 0.6736971
C <- matrix(runif(8), 2, 4, byrow = T)
rbind(A, C)
           [,1]       [,2]       [,3]       [,4]
[1,]  1.0000000  2.0000000  3.0000000  4.0000000
[2,]  5.0000000  6.0000000  7.0000000  8.0000000
[3,]  9.0000000 10.0000000 11.0000000 12.0000000
[4,] 13.0000000 14.0000000 15.0000000 16.0000000
[5,]  0.8161726  0.9740909  0.1279643  0.9867436
[6,]  0.5706480  0.5150818  0.8515889  0.7046885

We can also define arrays. These are generalizations of vectors and matrices to an arbitrary number of dimensions. A vector is a 1D array, a matrix is a 2D array. They are also called tensors.

D <- array(rnorm(2*4*5), dim = c(2, 4, 5))
D
, , 1

           [,1]      [,2]       [,3]        [,4]
[1,] -1.1494837 0.2753644 -0.6069892 -0.05844021
[2,]  0.1860709 0.6370736  0.8431723 -0.26942701

, , 2

          [,1]        [,2]       [,3]      [,4]
[1,] 0.4040653 -0.04382134 -0.2218624 0.1630243
[2,] 2.2414516  0.05005394 -1.1278691 0.4233376

, , 3

           [,1]       [,2]        [,3]       [,4]
[1,] -0.4949221 -0.5609446  1.75961921 -1.9711666
[2,]  0.7641867  0.5045475 -0.07788115 -0.5476058

, , 4

           [,1]      [,2]       [,3]       [,4]
[1,]  0.1495078 -1.636989 -1.7244224  0.2492688
[2,] -1.2017316 -1.202879  0.2139477 -1.0317071

, , 5

           [,1]       [,2]        [,3]        [,4]
[1,] -1.8509644 -0.7258099 -0.89236383  0.03814286
[2,]  0.2015316 -0.8681740 -0.08186036 -1.27583796

Indexing arrays

A[2,3] # selects the element in row 2, column 3
[1] 7
A[c(1,3),c(2,4)] # selects first and third row, second and fourth column
     [,1] [,2]
[1,]    2    4
[2,]   10   12
A[1:3,2:4]
     [,1] [,2] [,3]
[1,]    2    3    4
[2,]    6    7    8
[3,]   10   11   12
A[1:2,] # the comma with a blank space selects all the columns
     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
[2,]    5    6    7    8
A[,1:2] # the comma with a blank space selects all the rows
     [,1] [,2]
[1,]    1    2
[2,]    5    6
[3,]    9   10
[4,]   13   14
A[-c(1,3),] # removes first and third row
     [,1] [,2] [,3] [,4]
[1,]    5    6    7    8
[2,]   13   14   15   16
A[-c(1,3),-c(1,3,4)] # removes the specified rows and columns
[1]  6 14

Indexing generic arrays is done in the same way.

D[1,3,2]
[1] -0.2218624
D[,3,]
           [,1]       [,2]        [,3]       [,4]        [,5]
[1,] -0.6069892 -0.2218624  1.75961921 -1.7244224 -0.89236383
[2,]  0.8431723 -1.1278691 -0.07788115  0.2139477 -0.08186036

Dataframes

Data frames are simply lists of vectors. The vectors can be of different types (numeric, character, …). It is the most common format for the data.

a <- 1:3
b <- factor(1:3)
c <- letters[1:3]
x <- data.frame(a = a, b = b, c = c)
x
  a b c
1 1 1 a
2 2 2 b
3 3 3 c

We can visualize the names of each variable.

names(x)
[1] "a" "b" "c"

Use list operators to extract columns.

x$a
[1] 1 2 3
x$b # a vector of factors
[1] 1 2 3
Levels: 1 2 3
x[["c"]] # different factors
[1] "a" "b" "c"

Use matrix indexing, too!

x[1:2, 2:3]
  b c
1 1 a
2 2 b

Subsetting some rows and visualize the correspongind columns:

y <- subset(x, b %in% 2:3, select = c(b, c))
y
  b c
2 2 b
3 3 c

Lots of new fancy packages for manipulating data frames. See reshape, plyr, dplyr, sqldf and others.

Categorical data

We can define a vector of characters denoting the birthplace of some individuals.

birthplace <- c("Austin","Austin","Houston","Houston","Austin","Dallas","Houston","Austin","Dallas","Houston","Austin") 

Now we need to convert this vector to a factor (categorical vector), by explicitly providing with all the possible values of the vector.

birthplace <- factor(birthplace, levels = c('Austin','Houston','Dallas', 'San Antonio')) # note that in our fake data nobody was born in San Antonio
plot(birthplace) # barplot of the frequencies

abs_freq <- table(birthplace) 
abs_freq
birthplace
     Austin     Houston      Dallas San Antonio 
          5           4           2           0 
rel_freq <- table(birthplace)/length(birthplace) 
rel_freq
birthplace
     Austin     Houston      Dallas San Antonio 
  0.4545455   0.3636364   0.1818182   0.0000000 
round(rel_freq, 2)
birthplace
     Austin     Houston      Dallas San Antonio 
       0.45        0.36        0.18        0.00 

Matrix-vector operations

R does linear algebra very efficiently.

x <- 1:3 # [1, 2, 3]
x %*% x # inner product: equivalent to x %*% t(x)
     [,1]
[1,]   14
x %*% t(x) # outer product
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    2    4    6
[3,]    3    6    9

There are two functions that are more computationally efficient: crossprod(x) and tcrossprod(x). The first one computes t(x) %*% x and the second x %*% t(x).

A <- matrix(1:9, 3, 3, byrow = T)
A
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
A %*% x # matrix - vector product
     [,1]
[1,]   14
[2,]   32
[3,]   50
A %*% A # matrix - matrix product
     [,1] [,2] [,3]
[1,]   30   36   42
[2,]   66   81   96
[3,]  102  126  150
eigen(A) # eigenvalue decomposition
eigen() decomposition
$values
[1]  1.611684e+01 -1.116844e+00 -1.303678e-15

$vectors
           [,1]        [,2]       [,3]
[1,] -0.2319707 -0.78583024  0.4082483
[2,] -0.5253221 -0.08675134 -0.8164966
[3,] -0.8186735  0.61232756  0.4082483

We can also compute the sum of the elements of a matrix

sum(A)
[1] 45

the sum along the rows

rowSums(A)
[1]  6 15 24

and the sum along the columns.

colSums(A)
[1] 12 15 18

For more general operations along rows and columns, use apply.

apply(X = A, MARGIN = 1, FUN = var) # along rows
[1] 1 1 1
apply(X = A, MARGIN = 2, FUN = var) # along columns
[1] 9 9 9

Advantages of doing linear algebra in R:

  • Much faster than loops
  • Sparse matrices are available

Random number generation

We start with the normal distribution.

x <- rnorm(50) # if no additional argument is provided, rnorm is standard normal distriubtion
y <- x + rnorm(50, mean = 50, sd = .1)
cor(x,y)
[1] 0.9959406
set.seed(123)
y <- rnorm(50)
cor(x,y)
[1] 0.02657538
mean(y)
[1] 0.03440355
var(y)
[1] 0.8572352
sqrt(var(y))
[1] 0.92587
sd(y)
[1] 0.92587

Note that rerunning the same lines gives different results. These functions involve randomness. To fix the outcomes and have consistent results across different computers, we need to fix the random generator’s seed with set.seed().

We have also encountered the Bernoulli distribution.

x <- sample(x = 0:1, size = 1000, replace = TRUE, prob = c(0.5, 0.5))
barplot(table(x))

mean(x)
[1] 0.507
var(x)
[1] 0.2502012

Vectorized expressions

R expressions are vectorized.

x <- 1:5
x
[1] 1 2 3 4 5
y <- 2 * x
y
[1]  2  4  6  8 10

The rule is that the expression is evaluated for each element.

z <- 1:5
2 * z
[1]  2  4  6  8 10
for (i in 1:5){
  z[i] <- 2 * z[i]
}
z
[1]  2  4  6  8 10

Functions may or may not return a value for each element of an input vector.

sqrt(1:5)
[1] 1.000000 1.414214 1.732051 2.000000 2.236068
sum(1:5)
[1] 15
summary(1:5)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1       2       3       3       4       5 

Functional programming

Writing functions is fundamental for good formatting of the code.

f <- function(a, b = 2, c = NULL){
  res <- a * b
  if (!is.null(c)) res <- res * c
  return(res)
}
f(1, 2, 3)
[1] 6

In the previous function definition, note the role of the default values.

f(4)
[1] 8

Let’s take a look at how R handles the scope of the functions. Scope refers to the visibility of variables.

c <- "mice" # global
f <- function(a = "three") # function formals
{
  b <- "blind" # function body
  return(paste(a, b, c)) # three scopes
}
f()
[1] "three blind mice"

Object not found in current scope initiates search upward into enclosing scopes (it can be dangerous).

x <- 2 # global scope
f <- function() x <- 2 * x # 2 different x's here
x
[1] 2
f()
x
[1] 2

The assignment in the function body creates a variable x whose scope is the function body, not outside of it.

Graphics

Colours

RColorBrewer is an excellent package! But if you are nerdy enough, you can also consult this color Bible. Unfortunately the basic colors in R are pretty bad. See for example

plot(1:9, 1:9, pch = 16, col = 1:9)

You can also use different symbols to differentiate the points.

plot(1:20,1:20,pch=1:20)

Look how nicer the palettes in RColorBrewer are! Some of them are recommended for color blindness.

library(RColorBrewer)
display.brewer.all()

cols <- brewer.pal(9, "Set1")
plot(1:9, 1:9, pch = 16, col = cols[1:9])

Base plot

set.seed(123)
x <- rnorm(100) # 100 draws from a standard normal distribution
y <- rnorm(100)
plot(x, y, col = 'green') # you can also use the numbers to define the colors, as seen in the previous example 

plot(x, y, xlab = "this is the x-axis", ylab = "this is the y-axis", 
     main = "Plot of X vs Y", col = cols[3])

plot(x, y, col = cols[3], pch = 16)

We can also add a line with the command abline(), that take the argument v = for a vertical line, h = for a horizontal line, and a = , b = for intercept and slope, respectively.

plot(x, y, col = cols[3], pch = 16)
abline(h = 0, lty = 2, lwd = 2, col = cols[1])

Let us now see the use of two very useful plotting functions: lines() and hist(). First, we generate observations from a normal distribution using rnorm().

samp <- rnorm(n = 5000, mean = 3, sd = 2)
hist(samp, col = 'gray', border = 'white', probability = T, nclass = 30, main = 'Histogram', xlab = '')
x_grid <- seq(min(samp), max(samp), length.out = 100)
lines(x_grid, dnorm(x_grid, mean = 3, sd = 2), lwd = 2, col = cols[1])

Note how we overlay a curve on top of the histogram. We calculate the function on a fine grid of points on the x axis. Connecting the points (X, Y) approximate the function!

plot(x_grid, dnorm(x_grid, mean = 3, sd = 2), lwd = 2, col = cols[1], 
     main = 'How to plot functions', xlab = 'x', ylab = 'f(x)')

Two dimensional plots are a bit more complex. For example, say that we want to compute the contour lines of the function \(f(x, y) = \cos(y) / (1 + x^{2})\), on the domain \(x \in [-\pi, \pi], y \in [-\pi, \pi]\). We can represent the function surface in different ways.

x <- seq(-pi, pi, length.out = 50)
y <- x
f <- outer(x, y, function(x,y){cos(y)/(1+x^2)})
contour(x, y, f)
contour(x, y, f, nlevels = 45, add = T)

fa <- (f - t(f))/2
contour(x, y, fa, nlevels = 15)

image(x, y, fa)

persp(x, y, fa)

persp(x, y, fa, theta = 30)

persp(x, y, fa, theta = 30, phi = 40)

ggplot2

The same things can be done with the awesome ggplot2 package. It is just a matter of taste, I personally prefer this but I realize that it might be cumbersome for some.

library(reshape2)
library(ggplot2)

set.seed(123)
x <- rnorm(100) # 100 draws from a standard normal distribution
y <- rnorm(100)

ggplot() + 
  geom_point(aes(x = x, y = y), col = cols[3]) + 
  labs(x = "X", y = "Y", title = "Plot of X vs Y")

x <- seq(-pi, pi, length.out = 50)
y <- x
f <- outer(x, y, function(x,y){cos(y)/(1+x^2)})

plot_f <- melt(f)

ggplot() + 
  geom_contour(aes(x = Var1, y = Var2, z = value), data = plot_f) + 
  labs(x = "x", y = "y") + 
  theme(legend.title=element_blank())

fa <- (f - t(f))/2
plot_fa <- melt(fa)

ggplot() + 
  geom_contour(aes(x = Var1, y = Var2, z = value), data = plot_fa) + 
  labs(x = "x", y = "y") + 
  theme(legend.title=element_blank())

ggplot() + 
  geom_raster(aes(x = Var1, y = Var2, fill = value), data = plot_fa) + 
  scale_fill_viridis_c(option = 'D') +
  labs(x = "x", y = "y") + 
  theme(legend.title=element_blank())

Exercise 1

Write a R function to get the first \(n\) Fibonacci numbers.

Exercise 2

Write a script to plot the probability density functions of three random variables: \(X_{1} \sim N(0, 3^2)\), \(X_{2} \sim N(10, 0.5^2)\), \(X_{3} \sim N(-2, 1^2)\).

Exercise 3

This exercise has the goal of getting familiar with the central limit theorem (CLT). Suppose \(X_1, X_2, \dots, X_n\) is a sequence of i.i.d. random variables (not necessarily normal) with well defined mean \(\mu\) and variance \(\sigma^2\). Then, as \(n\) approaches infinity, the random variable \[\frac{X_1 + \dots + X_n}{n} \rightarrow \mathcal{N}(\mu, \sigma^2 / n)\]

For this exercise, sample \(100\) observations from a Bernoulli experiment with parameter \(p = 0.2\). Now take the expectation of this sample, using the function mean(). Can you repeat this process for \(1000\) times? Remember how to use the for loop. Now, take the \(1000\) generated sample means and plot them on a histogram. Is the shape of the sample familiar? Can you overlay the theoretical normal distribution that we expect from the CLT? You’ll have to remember mean and variance of the Bernoulli experiments.